Diary 2024-01-14 - NISHIO Hirokazu's Scrapbox (Auto-translated from Japanese)

Diary 2024-01-14

The naive vector similarity method separates English and Japanese.

ベクトル空間で英語と日本語は分離してる.icon

For me personally, this is a big problem, so I'm not very satisfied with a simple embedded vector similarity

But very few people use both English and Japanese, so you don't have to worry about it.

But when I went to bed and woke up, I felt like, "Why don't we machine translate every English sentence into Japanese and load it into a vector index?

Better than making it into a vector and then thinking "how do I paste together the spaces that are far apart"?

In Plurality Japanese translation, the language is now included in the vector index without separating the languages.

/plurality-japanese/Plurality Vector Search

I feel like I can't provide user value with that.

If it is Japanese who use it, the query is also usually in Japanese

But when I search for idioms, I get Chinese hits.

If so, it would be better to load the "Japanese" side of "other language -> Japanese" into the search index.

Then, turn the arrow in the opposite direction and read the original text after the Japanese hits if you want to.

The association of two chunks by machine translation is a kind of "link"

Hits by vector search are loose links

Compare with explicit human links in Scrapbox

Link meaning "same content, different language."

Scrapbox link to "Human Associations" link

What to record

Diary 2024-01-13 ← Diary 2024-01-14 → Diary 2024-01-15

100 days ago Diary 2023-10-06.

1 year ago Diary 2023-01-14.

---

This page is auto-translated from /nishio/日記2024-01-14 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.